Presentation: Tweet"Data Science at Scale with Spark"
Apache Spark has been blessed as the replacement for MapReduce in Hadoop environments. It also runs in other deployment modes. Spark provides better performance, better developer productivity, and it supports a wider range of application scenarios than MapReduce, including event stream processing, ad hoc queries, graphs, and iterative algorithms. Graphs are a natural way to represent many data sets, such as social media networks, and iterative algorithms are important for Machine Learning, such as model training with gradient descent.
This talks discusses Spark from a Data Science perspective, it's strengths and weaknesses, the Scala, Java, Python, and R APIs it offers for common analytics problems, what's missing, and what's planned. We'll look at support for ad hoc queries over large data sets, machine learning algorithms, graph processing, the programmer experience, and the pragmatic concerns of running applications.
Download slides